{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9122e4b9-4883-4e6e-940b-ab44a70f0951",
   "metadata": {},
   "source": [
    "# How to load documents from a directory\n",
    "\n",
    "LangChain's [DirectoryLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html) implements functionality for reading files from disk into LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects. Here we demonstrate:\n",
    "\n",
    "- How to load from a filesystem, including use of wildcard patterns;\n",
    "- How to use multithreading for file I/O;\n",
    "- How to use custom loader classes to parse specific file types (e.g., code);\n",
    "- How to handle errors, such as those due to decoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1c1e3796-bee8-4882-8065-6b98e48ec53a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import DirectoryLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3cdb7bb-1f58-4a7a-af83-599443127834",
   "metadata": {},
   "source": [
    "`DirectoryLoader` accepts a `loader_cls` kwarg, which defaults to [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file). [Unstructured](https://unstructured-io.github.io/unstructured/) supports parsing for a number of formats, such as PDF and HTML. Here we use it to read in a markdown (.md) file.\n",
    "\n",
    "We can use the `glob` parameter to control which files to load. Note that here it doesn't load the `.rst` file or the `.html` files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bd2fcd1f-8286-499b-b43a-0c17084ae8ee",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loader = DirectoryLoader(\"../\", glob=\"**/*.md\")\n",
    "docs = loader.load()\n",
    "len(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9ff1503d-3ac0-4172-99ec-15c9a4a707d8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Security\n",
      "\n",
      "LangChain has a large ecosystem of integrations with various external resources like local\n"
     ]
    }
   ],
   "source": [
    "print(docs[0].page_content[:100])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8b1cee8-626a-461a-8d33-1c56120f1cc0",
   "metadata": {},
   "source": [
    "## Show a progress bar\n",
    "\n",
    "By default a progress bar will not be shown. To show a progress bar, install the `tqdm` library (e.g. `pip install tqdm`), and set the `show_progress` parameter to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cfa48224-5d02-4aa7-93c7-ce48241645d5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 54.56it/s]\n"
     ]
    }
   ],
   "source": [
    "loader = DirectoryLoader(\"../\", glob=\"**/*.md\", show_progress=True)\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e02c922-6a4b-48e6-8c46-5015553eafbe",
   "metadata": {},
   "source": [
    "## Use multithreading\n",
    "\n",
    "By default the loading happens in one thread. In order to utilize several threads set the `use_multithreading` flag to true."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "aae1c580-6d7c-409c-bfc8-3049fa8bdbf9",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = DirectoryLoader(\"../\", glob=\"**/*.md\", use_multithreading=True)\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5add3f54-f303-4006-90c9-540a90ab8c46",
   "metadata": {},
   "source": [
    "## Change loader class\n",
    "By default this uses the `UnstructuredLoader` class. To customize the loader, specify the loader class in the `loader_cls` kwarg. Below we show an example using [TextLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.text.TextLoader.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d369ee78-ea24-48cc-9f46-1f5cd4b56f48",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "\n",
    "loader = DirectoryLoader(\"../\", glob=\"**/*.md\", loader_cls=TextLoader)\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2863d7dd-2d56-4fef-8bfd-95c48a6b4a71",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Security\n",
      "\n",
      "LangChain has a large ecosystem of integrations with various external resources like loc\n"
     ]
    }
   ],
   "source": [
    "print(docs[0].page_content[:100])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c97ed37b-38c0-4f31-9403-d3a5d5444f78",
   "metadata": {},
   "source": [
    "Notice that while the `UnstructuredLoader` parses Markdown headers, `TextLoader` does not.\n",
    "\n",
    "If you need to load Python source code files, use the `PythonLoader`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5ef483a8-57d3-45e5-93be-37c8416c543c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import PythonLoader\n",
    "\n",
    "loader = DirectoryLoader(\"../../../../../\", glob=\"**/*.py\", loader_cls=PythonLoader)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61dd1428-8246-47e3-b1da-f6a3d6f05566",
   "metadata": {},
   "source": [
    "## Auto-detect file encodings with TextLoader\n",
    "\n",
    "`DirectoryLoader` can help manage errors due to variations in file encodings. Below we will attempt to load in a collection of files, one of which includes non-UTF8 encodings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e69db7ae-0385-4129-968f-17c42c7a635c",
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"../../../../libs/langchain/tests/unit_tests/examples/\"\n",
    "\n",
    "loader = DirectoryLoader(path, glob=\"**/*.txt\", loader_cls=TextLoader)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3b61cf0-809b-4c97-b1a4-17c6aa4343e1",
   "metadata": {},
   "source": [
    "### A. Default Behavior\n",
    "\n",
    "By default we raise an error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "4b8f56be-122a-4c56-86a5-a70631a78ec7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Error loading file ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt\n"
     ]
    },
    {
     "ename": "RuntimeError",
     "evalue": "Error loading ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mUnicodeDecodeError\u001b[0m                        Traceback (most recent call last)",
      "File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/text.py:43\u001b[0m, in \u001b[0;36mTextLoader.lazy_load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     42\u001b[0m     \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mopen\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfile_path, encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mencoding) \u001b[38;5;28;01mas\u001b[39;00m f:\n\u001b[0;32m---> 43\u001b[0m         text \u001b[38;5;241m=\u001b[39m \u001b[43mf\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m     44\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mUnicodeDecodeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n",
      "File \u001b[0;32m~/.pyenv/versions/3.10.4/lib/python3.10/codecs.py:322\u001b[0m, in \u001b[0;36mBufferedIncrementalDecoder.decode\u001b[0;34m(self, input, final)\u001b[0m\n\u001b[1;32m    321\u001b[0m data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbuffer \u001b[38;5;241m+\u001b[39m \u001b[38;5;28minput\u001b[39m\n\u001b[0;32m--> 322\u001b[0m (result, consumed) \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_buffer_decode\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43merrors\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfinal\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    323\u001b[0m \u001b[38;5;66;03m# keep undecoded input until the next call\u001b[39;00m\n",
      "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte",
      "\nThe above exception was the direct cause of the following exception:\n",
      "\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[10], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mloader\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:117\u001b[0m, in \u001b[0;36mDirectoryLoader.load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    115\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mload\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m List[Document]:\n\u001b[1;32m    116\u001b[0m \u001b[38;5;250m    \u001b[39m\u001b[38;5;124;03m\"\"\"Load documents.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 117\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mlist\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlazy_load\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:182\u001b[0m, in \u001b[0;36mDirectoryLoader.lazy_load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    180\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    181\u001b[0m     \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m items:\n\u001b[0;32m--> 182\u001b[0m         \u001b[38;5;28;01myield from\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_lazy_load_file(i, p, pbar)\n\u001b[1;32m    184\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m pbar:\n\u001b[1;32m    185\u001b[0m     pbar\u001b[38;5;241m.\u001b[39mclose()\n",
      "File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:220\u001b[0m, in \u001b[0;36mDirectoryLoader._lazy_load_file\u001b[0;34m(self, item, path, pbar)\u001b[0m\n\u001b[1;32m    218\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    219\u001b[0m         logger\u001b[38;5;241m.\u001b[39merror(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mError loading file \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mstr\u001b[39m(item)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 220\u001b[0m         \u001b[38;5;28;01mraise\u001b[39;00m e\n\u001b[1;32m    221\u001b[0m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[1;32m    222\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m pbar:\n",
      "File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/directory.py:210\u001b[0m, in \u001b[0;36mDirectoryLoader._lazy_load_file\u001b[0;34m(self, item, path, pbar)\u001b[0m\n\u001b[1;32m    208\u001b[0m loader \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mloader_cls(\u001b[38;5;28mstr\u001b[39m(item), \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mloader_kwargs)\n\u001b[1;32m    209\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 210\u001b[0m     \u001b[38;5;28;01mfor\u001b[39;00m subdoc \u001b[38;5;129;01min\u001b[39;00m loader\u001b[38;5;241m.\u001b[39mlazy_load():\n\u001b[1;32m    211\u001b[0m         \u001b[38;5;28;01myield\u001b[39;00m subdoc\n\u001b[1;32m    212\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mNotImplementedError\u001b[39;00m:\n",
      "File \u001b[0;32m~/repos/langchain/libs/community/langchain_community/document_loaders/text.py:56\u001b[0m, in \u001b[0;36mTextLoader.lazy_load\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m     54\u001b[0m                 \u001b[38;5;28;01mcontinue\u001b[39;00m\n\u001b[1;32m     55\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m---> 56\u001b[0m         \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mError loading \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfile_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01me\u001b[39;00m\n\u001b[1;32m     57\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m     58\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mError loading \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfile_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01me\u001b[39;00m\n",
      "\u001b[0;31mRuntimeError\u001b[0m: Error loading ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt"
     ]
    }
   ],
   "source": [
    "loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48308077-2d99-4dd6-9bf1-dd1ad6c64b0f",
   "metadata": {},
   "source": [
    "The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.\n",
    "\n",
    "With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.\n",
    "\n",
    "### B. Silent fail\n",
    "\n",
    "We can pass the parameter `silent_errors` to the `DirectoryLoader` to skip the files which could not be loaded and continue the load process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "b333c652-a7ad-47f4-8be8-d27c18ef11b7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Error loading file ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt: Error loading ../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt\n"
     ]
    }
   ],
   "source": [
    "loader = DirectoryLoader(\n",
    "    path, glob=\"**/*.txt\", loader_cls=TextLoader, silent_errors=True\n",
    ")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "b99ef682-b892-4790-8964-40185fea41a2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../../../../libs/langchain/tests/unit_tests/examples/example-utf8.txt']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_sources = [doc.metadata[\"source\"] for doc in docs]\n",
    "doc_sources"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da475bff-2f4f-4ea3-a058-2979042c5326",
   "metadata": {},
   "source": [
    "### C. Auto detect encodings\n",
    "\n",
    "We can also ask `TextLoader` to auto detect the file encoding before failing, by passing the `autodetect_encoding` to the loader class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "832760da-ed9f-4e68-a67c-35493bde2214",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_loader_kwargs = {\"autodetect_encoding\": True}\n",
    "loader = DirectoryLoader(\n",
    "    path, glob=\"**/*.txt\", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs\n",
    ")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5c4f4dba-f84f-496e-9378-3e6858305619",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../../../../libs/langchain/tests/unit_tests/examples/example-utf8.txt',\n",
       " '../../../../libs/langchain/tests/unit_tests/examples/example-non-utf8.txt']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_sources = [doc.metadata[\"source\"] for doc in docs]\n",
    "doc_sources"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
